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Modeling a temporal process as if it is Markovian assumes the present encodes all of the process’s 
history. When this occurs, the present captures all of the dependency between past and future. We 
recently showed that if one randomly samples in the space of structured processes, this is almost never 
the case. So, how does the Markov failure come about? That is, how do individual measurements 
fail to encode the past? And, how many are needed to capture dependencies between the past and 
future? Here, we investigate how much information can be shared between the past and future, 
but not be reflected in the present. We quantify this elusive information, give explicit calculational 
methods, and draw out the consequences. The most important of which is that when the present 
hides past-future dependency we must move beyond sequence-based statistics and build state-based 
models. 
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I. INTRODUCTION 

Until the turn of the nineteenth century, temporal 
processes were almost exclusively considered to be inde¬ 
pendently sampled at each time from the same statisti¬ 
cal distribution. These studies were initiated by Jacob 
Bernoulli in the 1700s [T] and refined by Simeon Poisson 
|2] and Pafnuty Chebyshev [3] in the 1800s, leading to 
the weak Law of Large Numbers and the Central Limit 
Theorem. These powerful results were the first hints 
at universal laws in stochastic processes, but they ap¬ 
plied only to independent, identically distributed (HD) 
processes—unstructured processes with no temporal cor¬ 
relation, no memory. Moreover, until the turn of the 
century it was believed that these laws required indepen¬ 
dence. It fell to Andrei Andreevich Markov (1856-1922) 
to realize that independence is not necessary. To show 
this he introduced a new kind of sequence or “chain” of 
dependent random variables, along with the concepts of 
transition probabilities, irreducibility, and stationarity 

mu- 

Introducing his “complex chains” in 1907, Markov initi¬ 
ated the modern study of structured, interdependent, and 
correlated processes. Indeed, in the first and now-famous 
application of complex chains, he analyzed the pair distri¬ 
bution (2-grams) in the 20, 000 vowels and consonants in 
Pushkin’s poem Eugeny Onegin and the 100, 000 letters in 
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Aksakov’s novel The Childhood of Bagrov, the Grandson 
Eiia. Since Markov’s time the study of complex chains 
has developed into one of the most powerful and widely 
applied mathematical theories, far beyond quantitative 
linguistics to physics, biology, and finance. 

Here, we take an information-theoretic view of Marko¬ 
vian complexity arising from temporal interdependency 
between observed symbols that are not themselves the 
chain states. Specifically, we consider stationary, ergodic 
processes generated by hidden Markov chains (HMCs); in¬ 
troduced in the mid-twentieth century as a generalization 
of Markov’s chains necessary to model processes generated 
by communication channels [8]. When are these hidden 
processes described by finite Markov chains? When are 
they not Markovian? What’s the informational signature 
in this case? And, what are “states” in the first place? 
Can we discover them from observations of a hidden pro¬ 
cess? 

The following is the first in a series that addresses these 
questions: which have been answered, which can be an¬ 
swered, and which are open. Here, we concentrate on how 
the present—a sequence of £ consecutive measurements— 
statistically shields the past from the future, introducing 
the elusivity <r^ as a quantitative measure of shielding. 
We show how to calculate it explicitly and then describe 
and interpret its behavior (and that of related measures) 
through examples. As an application we use the results to 
reinterpret the persistent mutual information introduced 
by Ref. jH] as a measure of “emergence” in complex sys¬ 
tems. The sequel m is analytical, giving closed-form 
solutions and proving various properties, including several 
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of those used here. 

The next section reviews the minimal necessary back¬ 
ground of information theory Cl, computational mechan¬ 
ics Pi, and a recent analysis of information in the context 
of the past and future m- We then give our main new 
result that expresses the elusivity in terms of a process’s 
causal states. This leads to simple and efficient expres¬ 
sions, the basis for further analytical development and 
empirical estimations. We illustrate the elusivity for a 
number of prototype processes and finally compare it to 
other information measures. We close with a discussion 
of the results, drawing conclusions for future applications. 


II. INFORMATION IN COMPLEX PROCESSES 

A. Processes 

We are interested in a general stochastic process V: 
the distribution of all of a system’s behaviors or realiza¬ 
tions {... cc_ 2 , x_i,Xq, as specified by their joint 

probabilities Pr(... A'_ 2 , X_i,X 0 , X \,...). X t is a ran¬ 
dom variable that is the outcome of the measurement 
at the time t , taking values Xt from a finite set A of all 
possible events. We denote a contiguous chain of ran¬ 
dom variables as Xq.j. = XoXi ■ ■ ■ X_i- Left indices are 
inclusive; right, exclusive. We suppress indices that are 
infinite. We consider only stationary processes for which 
Pr (Xf.t+i) = Pr(X 0: £) for all t and £. 

Our particular emphasis in the following is that 
a process Pr(A,o, Xq.j, Xt.) is a communication chan¬ 
nel that transfers information from the past X-.q = 

... X_ 3 X_ 2 W_i to the future Xu- = XX+iX+ 2 ■■■ by- 
storing parts of it in the present Xo-j = XoX\ ... W_i of 
length £. Of primary concern is whether X-q —> Xq.j —> 
Xt. forms a Markov chain in the sense of Ref. [2]: 

P 1 (A_ m; o, Aq.£, Xt.n) = 

Pr(X_ m .. 0 \X 0 .. ( ) Pr(Xt. n \X Q:e ) Pi(X 0lt ) , 

for all m,n £ Z + . 


B. Channel Information 

In analyzing this channel we need to measure the vari¬ 
ous forms of information being communicated. The sim¬ 
plest is Shannon entropy M- 

R [ x ] = Pr ( x ) log 2 Pr ( x ) • w 

x£X 

Three other information-theoretic measures based on the 
entropy will be employed throughout. First, the con¬ 


ditional entropy , measuring the amount of information 
remaining in a variable X (alphabet X) once the infor¬ 
mation in a variable Y (alphabet y) is accounted for: 

H[A|F] =-^2 Pr(x, y) log 2 Pr(x\y) . (2) 

x£X 

yey 

Second, the deficiency of the conditional entropy relative 
to the full entropy is known as the mutual information , 
characterizing the information that is contained in both 
X and Y: 


l[X:Y] = U[X] - H[X\Y] 


= Pr ( X > y) l0g 2 

xex 

yey 


P r{x,y) 
Pr(x)Pr(y) 


( 3 ) 


Last, we have the conditional mutual information , the 
mutual information between two variables once the infor¬ 
mation in a third (Z with alphabet Z) has been accounted 
for: 

l[X :Y\Z\=^2 Pr 0, y , z ) lo S2 P M \’pM X ■ ( 4 ) 

^ Yr(x\z)¥r{y\z) 

yey 


Perhaps the most naive way of information-theoretically 
analyzing a process, capturing the randomness and de¬ 
pendencies in sequences of random variables, is via the 
block entropies: 


H(£) = R[X 0 :e] . (5) 

This quantifies the amount of information in a contiguous 
block of observations. Its growth with i gives insight into 
a process’s randomness and structure HU 15]: 

H(£) « E + h/, i » 1 . (6) 

The asymptotic growth h , here, is a process’s rate of 
information generation, or the Shannon entropy rate: 

h^=R[X 0 \X :0 ). (7) 

And, the amount of future information predictable from 
the past is the past-future mutual information or excess 
entropy: 


E = l[X :0 :X 0: ) (8) 

= H[X 0 ,X 0: ] -H[X 0 |Xo:] . 

The excess entropy naturally arises when considering chan¬ 
nels with a length £ = 0 present, where it is effectively 
the only direct information quantity over the variables 
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X : o and Xo : . It is well known that if the excess entropy 
vanishes, then there is no information temporally commu¬ 
nicated by the channel nn. 

Generically, Eq. ([8]) is of the form oo — oo, which is 
meaningless. In such situations one refers to finite se¬ 
quences and then takes a limit: 

lim (H[X_ m: 0 ,X 0 :„]-H[X_ m: o|X 0:n ]) . 

m,n—> oo 

Here, we generally use the informal infinite variables in 
equations for clarity and simplicity unless the details of 
the limit are important for the analysis at hand. To be 
concrete, we write /(X :0 ) to mean lim /(X_ m . 0 ) and 

m—too 

f(X e: ) to mean lim /(X^ :n ). 

n—>oo 

C. Information Atoms 

Our goal here is to analyze a process as a channel as a 
function of the present’s length £. The cases of £ = 0 and 
£ = 1 have already been addressed: t = 0 in Ref. m and 
£ = 1 in Ref. [13] , Our development closely mirror theirs. 
We borrow notation, but must include a superscript to 
denote the ^-dependence of the quantities. Our immediate 
concern is that of monitoring the amount of dependency 
remaining between the past and future if the present is 
known. We use the mutual information between the past 
and the future conditioned on the present to do so—the 
elusivity that will soon become our focus: 

al = I[X :0 :X e ..\Xo :e \ (9) 

= H[Xo|Ao:^ + H[X /: |X 0:/ ] - H[X : 0 ,X&|Xo:/] . 


Note that er° = E. 

h 1 

Next, again following Ref. m , we decompose the 
length-£ present. When considering only the past, the 
information in the present separates into two compo¬ 
nents: p l = I[X :0 : X 0 .j], the information that can be 
anticipated from the past, and = H[Xo^|X ; o], the ran¬ 
dom component that cannot be anticipated. Naturally, 
H[A 0: £] = h i fi + p^. Connecting directly to Ref. [Q] , our 
p * is their p and, likewise, our h * is their h^. 

If one also accounts for the future’s behavior, then the 
random, unanticipated component h e breaks into two 
kinds of information: one part b ^ = 1[Xq.j : Xg : |X,o] that, 
while some degree of randomness, is relevant for predicting 
the future; and the remaining part 7" = H[Xo : ^|X : o, Xe : ] 
is ephemeral, existing only fleetingly in the present and 
then dissipating, leaving no trace on future behavior. 

The redundant portion p e of H[A' 0: f] itself splits into 
two pieces. The first part, I[X ;0 : Xq._i\X ^\—also b e ^ when 
the process is stationary—is shared between the past and 
the current observation, but its relevance stops there. 


The second piece q £ = I[X :0 :Xo : f :X&] is anticipated by 
the past, is present currently, and also plays a role in 
future behavior. Notably, this informational piece can be 
negative Haem. 

Due to a duality between set-theoretic and information- 
theoretic operators, we can graphically represent the rela¬ 
tionship between these various informations in a Venn-like 
display called an information diagram Et3; see Fig. [I] 
Similar to a Venn diagram, size indicates Shannon en¬ 
tropy rather than set cardinality and overlaps are not set 
intersection, but mutual information. Each area on the di¬ 
agram represents one or another of Shannon’s information 
measures. 


H[X 0:f ] 



FIG. 1. The process information diagram that places the 
present in its temporal context: the past (Xo) and the future 
(X( : ) partition the present (Ao^) into four components with 
quantities r£, q and two with b^. Notably, the component 
cr^, quantifying the hidden dependency shared by the past 
and the future, is not part of the present. 


As mentioned above, the past splits H[X 0 .j] yielding 
two pieces: h^, the part outside the past, and p^, the part 
inside. This partitioning arises naturally when predicting 
a process ns). To emphasize, Fig. [2a| displays this de¬ 
composition. If we include the future in the diagram, we 
obtain a more detailed understanding of how information 
is transmitted from the past to the future. The past and 
the future together divide the present H[X 0 ^] into four 
parts, as shown in Fig. |2b| 

The process information diagram makes it rather trans¬ 
parent in which sense is an amount of ephemeral in¬ 
formation: its information lies outside both the past and 
future and so it exists only in the present moment. It has 
no repercussions for the future and is no consequence of 
the past. It is the amount of information in the present 
observation neither communicated to the future nor from 
the past. With 1=1, this has been referred as the resid¬ 
ual entropy rate m, as it is the amount of uncertainty 
that remains in the present even after accounting for ev¬ 
ery other variable in the time series. It has also been 
studied as the erasure information [20] (there H~), as it 
is the information irrecoverably erased in a binary erasure 
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H[X 0: /] 

H[X 0: ^] 

\ J 

V b i /*\ K / 

y 

x . % y 

(a) Decomposition 

(b) Decomposition 

due to the past. 

due to the past and 


the future. 


FIG. 2. Alternative decompositions of the present information 
H[X 0 :f]. 


channel. 

The bound information b is the amount of sponta¬ 
neously generated information present now, not explained 
by the past, but that has consequences for the future. In 
this sense it hints at being a measure of structural com¬ 
plexity [13 HU, though we discuss more direct measures 
of structure shortly. 

Due to stationarity, the mutual information I[Xq.j : 
Xt\X.u\ between the present X 0 .j and the future X( : 
conditioned on the past X,o is the same as the mutual 
information 1{Xq.j :X : o|X/> : ] between the present Xq.j and 
the past X : o conditioned on the future X( : . Therefore 
they are both of size b^, as shown in Fig. |lj This lends a 
symmetry to the process information diagram that need 
not exist for nonstationary processes. 


D. Elusivity 

Two components remain in the process information 
diagram—two that have not been significantly analyzed 
previously. The first is q 1 = I[X. 0 : X 0 .f: X^.]- -the three- 
way mutual information (or co-information m shared 
by the past, present, and future. Notably, unlike Shan¬ 
non entropies and two-way mutual information, q ^ (and 
co-informations in general) can be negative. The other 
component a ^ = 1\X.q-.X^\Xq : (\, the quantity of primary 
interest here (shaded in Fig. [T]) is the information shared 
between the past and the future that does not exist in the 
present. Since it measures dependency hidden from the 
present, we call it the elusive information or elusivity for 
short. Generally, it indicates that a process has hidden 
structures that are not appropriately captured by finite 
random-variable blocks. In this case, and as we discuss 
at length towards the end, one must build models whose 
elements, which we call “states” below, represent how a 
process’s internal mechanism is organized. 

A process’s internal organization somehow must store 
all the information from the past that is relevant for 


generating the future behavior. Only when the observed 
process is Markovian is it sufficient to keep the track 
of just the current observable or block of observables. 
For the general case of non-Markovian processes, though, 
information relevant for prediction is spread arbitrarily far 
back in the process’s history and so cannot be captured 
by the present regardless of its duration. This fact is 
reflected in the existence of erf. When erf. > 0 for all £, 
the description of the process requires determining its 
internal organization. This is one reason to build a model 
of the mechanism that generates sequences rather than 
simply describe a process as a list of sequences. 

There are two basic properties that indicate the elu- 
sivity’s importance. The first is that cr^ decreases mono- 
tonically as a function of the present’s length £. That 
is, dependency cannot increase if we interpolate more 
random variables between the past and future. 

Proposition 1. cr^ > cr^ , if (! > £. 

The second property is that it indicates how poorly the 
present X 0 .j shields the past X :0 and future X^ : . When 
it does, they are conditionally independent, given the 
present, and a £ vanishes. Due to this, it can be used to 
detect a process’s Markov order R : the smallest R for 
which Pr(X 0 |X :0 ) = Pr(X 0 |X_ fl:0 ). 

Proposition 2. tr^ = 0 -£=> £ > R. 

Proofs are given in Ref. m- 

To calculate <r£, recall its definition as a conditional 
mutual information: 


o£ = I[X : o:X /: |Xo:/] 


= ^Pr(a; : )log 2 

X : o(zX.Q 

x O:i^^O-.e 

X£:EX(_ : 


Pr(z :0 , Xi-\xp-.e) 

Vr{x.. 0 \xo-.e)I , Y{xf:,\xo..t) 


where we used the notational shorthand for the bi-infinite 
joint distribution Pr(x : ) = Pr(x ; o, Xq-j, X£ : ). 

Note that for an order-1? Markov process, if £ > R the 
past and the future are independent over range R [T5] and 
so Pr(X :0 ,X /: |X 0:/ ) = Pr(X :0 |Xo:*)Pr(X*|X :0 ). With 
this, it is clear that erf vanishes in such cases. This 

A 6 

current proposition has been discussed in prior literature 
as well ETj . 

Anticipating the needs of our calculations later, we 
replace conditional distributions with the joint ones: 
Pr(x :0 , xe : \ xq-.i) = Pr(x,)/Pr(xo : £) and Pr(a; :0 |a:o^) = 
Pr(a; :0 , x 0: ^)/Pr (x 0 -j), obtaining: 


^ = H Pr ( X: ) lo S2 

%:0£X : o 

x O:£^Xq:£ 
xg : ^.Xg. 


Pr(zo:i) Pr(a : ) 
Pr(aj :0 ,2;o:*)Pr(a;o:*,ar&) 


( 10 ) 







5 


Notably, all the terms needed to compute o^ are either 
Pr(a; : o, Xo-x,xt;) or marginals thereof. Our next goal, 
therefore, is to develop the theoretical infrastructure nec¬ 
essary to compute that distribution in closed form. 

Similar expressions, which we use later on but do not 
record here, can be developed for the other information 
measures h^, r£, &£, and g*. 

III. STRUCTURAL COMPLEXITY 

To analytically calculate the elusive information o l we 
must go beyond the information theory of sequences and 
introduce computational mechanics, the theory of process 
structure m- The representation it uses for a given 
process is a form of hidden Markov model (HMM) [22 ] : 
the e-machine , which consists of a set S of causal states 
and a transition dynamic T. e-Machines satisfy three 
conditions: irreducibly, unifilarity, and probabilistically 
distinct states [J31. Irreducibly implies that the associated 
state-transition graph is strongly connected. Unifilarity, 
perhaps the most distinguishing feature, means for each 
state a £ S and each observed symbol x there is at most 
one outgoing transition from S labeled x £ A. Critically, 
unifilarity enables one to directly calculate various process 
quantities, such as conditional mutual informations, using 
properties of the hidden states. Notably, many of these 
quantities cannot be directly calculated using the states 
of general (nonunifilar) HMMs. Finally, an HMM has 
probabilistically distinct states when, for every pair of 
states S and S', there exists a word w such that the 
probability of observing w from each state is distinct: 
Pr(w|<S) 7 ^ Pr(u>|S'). An irreducible, unifilar model with 
probabilistically distinct states is minimal in the sense 
that no model with fewer states or transitions generates 
the process. An HMM satisfying these three properties is 
an e-machine. 

A. Constructing the e-Machine 

Given a process, how does one construct it’s e-machine? 
First, a process’s forward causal states: 

S+ = X/~+ (11) 

is the partition defined via the causal equivalence relation: 

x :t ~t x 't = Pr(X t: |X :t = x, t ) = Pr(X t: |X ;t = x'. t ) . 

( 12 ) 

That is, each causal state o + £ S + is an element of the 
coarsest partition of a process’s pasts such that every 
x,g £ o + makes the same prediction Pr(A'o : |-). In fact, 


H[X 0: ,] 



FIG. 3. Mutual information I[Xo : Xq-. t\ between the past and 
the present (shaded) is equivalent to the mutual information 
I[<S,(" : Xo-.i] between the forward causal state and the present. 


the causal states are the minimal sufficient statistic of the 
past to predict the future. We define the reverse causal 
states: 

ST = Xt:/~7 ■ ( 13 ) 

by similarly partitioning the process’s futures: 

x t:~7 x 'f = Pr(X t |*t: = x t .) = Pr(X t |X t: = 4 ) ■ 

(14) 

Second, the causal equivalence relation provides a natu¬ 
ral unifilar dynamic over the states. For each state a and 
next symbol x, either there is a successor state a' such 
that the updated past x-.t+i = x-.tx £ o’ , for all x,t £ o, 
or x,t+i does not occur. Due to causal-state equivalence, 
every past within a state collectively either can or cannot 
be followed by a given symbol. Moreover, since the causal 
states form a partition of all pasts, there is at most one 
causal state to which each past can advance. 

For an HMM with states p £ 1Z, its symbol-labeled 
transition matrix elements are the probabilities of going 
from state p to state p' and generating the symbol x: 

T$ = Pr(X t = *, TZ t+1 = p'\n t = p) . (15) 

Furthermore, the internal-state dynamics is governed by 
that stochastic matrix T = T ^ . Its unique left eigen- 

X 

vector 7r, associated with eigenvalue 1, gives the asymp¬ 
totic state probability Pr(p). By extension, the transition 
matrix giving the probability of a word w = xqX\ ■ ■ ■ xe-i 
of length I is the product of transition matrices of each 
symbol in w: 

y(«>) = y(*») = y(*o)y(zi) . .. yO^) _ 

Xi£w 
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B. Rendering <7* Finitely Computable 


We can put the forward and reverse causal states to use 
since they are proxies for a process’s semi-infinite pasts 
and futures, respectively. See, e.g., Fig. [3j In this way, 
we transform Eq. © into a form containing only finite 
sets of random variables. We calculate directly: 


< = I[X :0 : X e: \X 0:e ] 

= I[X :0 : (X 0: /,X /: )]-I[X :0 :X 0U ] 

= I[X ;0 : X 0: ] - I[X :0 : X ou ] 

^I[S+:So]-I[S+:Xou] 

= I[5 0 + : So] - (I[5 0 + : X 0: * : 5 0 "] +1[5 0 + : Xo : /|5 0 "]) 

^I[S+:So]-l[S^:X 0 ..t:So] 

= I[5 0 + :5 0 -|X 0 :/] 

= I[S+ : Sq : Si\X 0:i ]+l[S+ : 5 0 "|X 0:/ , Sf] 

=I[S+ : 5 0 " : |X 0: /] 

= I[S 0 + : Si |X 0:i ] - I[5 0 + : \X 0:t , 5 0 “] 

= l[S+ : Sl\Xou] 

- (H[S 0 + |X 0:< ,S 0 -] - H[5 0 + |5 0 -,X 0 :/,57]) 

= I[5 0 + :5 / -|X 0: /] • (17) 


Above, (a) is true due to Eqs. ( pL2|) to (131 and Ref. [21] . 
(6) is true due to Eqs. (131 and (|14[) , (c) is true due to 
Eq. (141 and unifilarity, and finally (d) is true due to 
both entropy terms being equal to H[«So‘|S ( f] by Eqs. (13) 
and (141. That is, Sl informationally subsumes both Xq j 
and Si when it comes to X : o and, therefore, also when it 
comes to All other equalities are basic information 
identities found in Ref. Ell¬ 
in this way, Eq. © says that Eq. ( flO| ) becomes, in 
terms of causal states, a new expression for elusivity: 


a 


i 


= © Pr (°o -,x 0: i,cr e )log 2 

CT+G5+ 

XO:i£Xo : £ 

a~eS~ 


Pr(a: 0: ^)Pr(g- ( ]~,a: 0 : < ;,g € ) 
Pr(oo", x 0 -i) Pr(x 0; ^, ol ) 


(18) 


We transformed the key distribution Pr(.T : o, Xo-j, X£ : ) over 
random variables X : o and X( : with cardinality of the con¬ 
tinuum to Pr(cr)j", Xq-.(, al) over iS + and S~ with typically 
smaller cardinality. When the causal states are finite or 
countably infinite, the benefit is substantial. We will now 
turn our attention to computing this joint distribution. 

Since the distribution is over both forward and reverse 
causal states, we must track both simultaneously. The 
key tool for this is Ref. [25]’s bidirectional machine or 
bimachine. We point the reader there for details regarding 



CC 

+ 1 

<0 

St 2 

Sti 

S 0 + 

St 

St 

St 

St 


A'_4 

X-3 

X_ 2 


Ao 


x 2 

x 3 



$-3 

*^-2 

SI 1 

5 0 - 

st 

s 2 

St 

<57 


FIG. 4. Random variable lattice illustrating the relationship 
between forward causal states S t + , observed symbols X t , and 
reverse causal states 5 ( _ - The variables in the distribution 
Pr(5^,X_ 1;2 ,5 2 -) are highlighted. In particular, elusivity 
ct® is the mutual information between the two shaded cells 
(iSjjj and Sj“) conditioned on the hatched cells (A'_i :2 = 
X.rXoXi). 


their construction and properties. One feature we need im¬ 
mediately, though, is that bimachine states pt = (a+,crl) 
are pairs of forward and reverse causal states. 

Generally, given an HMM with states p £ 77, we can 
construct the distribution of interest if we can find a 
way to build distributions of the form Pr (pi,w,pj): the 
probability of being in state p ,, generating the word w, and 
ending in state pj. The word transition matrix (Eq. ©) 
gives exactly this and allows us to build the distribution 
directly: 


Pr(pi, w, pj) = (-7T o lj)T^-*lJ , 


(19) 


where pi and pj are the states of an arbitrary HMM, aob 
is the Hadamard (elementwise) product of vectors a and b , 
and 1 i is the row vector with all its elements zero except 
for the i th , which is 1. 


Applying Eq. (191 to the bimachine, we arrive at the 
distribution Pr((<Sg",5^’), A'o^, (<5>/,<S^~)), which can be 
marginalized to Pr)^, X 0: ^, <S^~), the distribution needed 
to compute Eq. ( |l8| ). Figure [4] illustrates this distribution 
in the setting of the process’s random variable lattice and 
the forward and reverse causal state processes. 


C. Companion Atoms 


Causal-state expressions for hj), r^, fej), and q ^ that we 
use in the following are: 


K = H[X 0: ,|S 0 +] , 

= H[X 0: /|S 0 + : SJ] , 
b i = ^[Xo-.i : , and 

ql = I[S+ : Xq.j : Sf] . 


These are derived in ways paralleling that above for cr^ 
and so we do not give detail. They, too, also depend on 
the joint distribution above in Eq. (19) and its marginals. 
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(a) Forward e-machine. 



(b) Reverse e-machine. 


o : i 



FIG. 5. The several faces of the Golden Mean (GM) Process. 


IV. EXAMPLES 

Let’s consider several example processes, to illustrate 
calculation methods and to examine the behavior of erf, 
and companion measures. 


A. Golden Mean Process 


state distribution 7r = ( 1 / 3 , 1 / 3 , V 3 ) : 


Pr(0) Pr(A, 0, C) 

2/ 3 ' 

1/6 

Pr(A, 0) Pr(0, C) ~ 

V 3 ' 

V 3 

Pr(0) Pr(A, 0, D) 

2/3. 

1/6 

Pr(A,0) Pr(0, D) 

V 3 ' 

V 3 

Pr(0) Pr(R, 0, C) 

2/3. 

1/6 

Pr(R, 0) Pr(0, C) 

V 3 ' 

V 3 

Pr(0) Pr(B, 0, D) 

2/3. 

1/6 

Pr(R,0) Pr(0, D) 

V 3 ' 

V 3 

Pr(l) Pr(A, 1, C) 

V 3 - 

V 3 

Pr(A 1) Pr(l, C) 

V 3 ' 

V 3 


We see that the argument of each log 2 in Eq. (181 is 1, 
confirming that er* = 0. 


B. Information Measures versus Present Length 

We now investigate the behavior of cr 1 and its com¬ 
panions q^, b and for several example processes: the 
aforementioned GM, the Even, the Noisy Period Three 
(NP3), and the Noisy Random Phase-Slip (NRPS) Pro¬ 
cesses. The e-machines for the latter are shown in Fig. [6] 
Each exhibits different convergence behaviors with £ for 
the differing measures; see the graphs in Fig. [ 7 ] We now 
turn to characterizing each of them. 

We first consider cr^, seen in Fig. |7js upper left panel. 
While for each process <r^ vanishes with increasing £, the 
convergence behaviors differ. The Golden Mean Process 
is identically zero at all lengths due to its order-1 Markov 
nature, just noted. The NRPS Process, with a Markov 
order of R = 5, has nonzero oj; for £ < 5 and zero o/) = 0 
beyond. Finally, both the Even and Nemo Processes 
are infinite-order Markov and so their a e never exactly 
vanish, though they converge exponentially fast. The next 
section, Section |IV C[ discusses exponential convergence 
in more detail. 

Next, consider q and b e which, as it turns out, are 
closely associated. To see why, first examine the large -£ 
limit: 


As the first example we analyze the Golden Mean (GM) 
Process, whose e-machines and bimachine state-transition 
diagrams are given in Fig. [5] The GM Process consists of 
all bi-infinite strings such that no consecutive Is occur, 
with probabilities such that either symbol is equally likely 
following a 0. A stochastic generalization of subshifts of 
finite type [2Bj this process can be described by a Markov 
chain with order R = 1. Due to Prop. [2]we expect a / = 0. 
To verify this, we compute each term of Eq. (181 using 
the edges of the bimachine, Fig. |5cJ and the invariant 


lim l[X 0 .,e : X t: ] = E 

£—^oo 

= + V? • 

Second, we can decompose stationary, finite-E processes 
into two classes: those with state mixing and those with¬ 
out. State mixing refers to the convergence of initial state 
distributions to a unique invariant distribution, one that 
in particular does not oscillate. The GMP, Even, and 
NRPS Processes are examples of those with state mix¬ 
ing, while the NP3 Process asymptotic state distribution 


















(a) The Even Process. 



(b) The Noisy Period Three 
(NP3) Process. 


0 : \ 



(c) The Noisy Random Phase-Slip 
(NRPS) Process. 


FIG. 6. e-Machines for the Example Processes. 


is period-3, when starting from typical initial distribu¬ 
tions. With state mixing X :0 and Xf. become independent 
in the infinite- 1 ? limit, and so < 7 “ = 0 and we conclude 
= E. That is, the entire contribution to excess en¬ 
tropy comes from the bound information. Without state 
mixing, though, = 0, and so g£° = E. 

These two classes of convergence behavior are apparent 
when comparing Fig. [T]s upper-right and lower-left panels. 
In the upper-right, q 1 converges to zero for the GM, Even, 
and NRPS Processes. The NP3 Process, in contrast, 
limits to a constant value: it’s excess entropy E = log 2 p, 
where p = 3 is the period of the internal state cycle. In 
the lower-left, bi limits to constant values for the three 

F 

state-mixing processes, while the NP3 Process b ^ is zero 
for all i. 

Finally, the ephemeral information t~, plotted in 
Fig. 0s lower-right panel, also depends on whether a 


— Golden Mean; — Even; — NP3; NRPS 



0 - ^ - 0 

_!_ 1 _!_!_ _!_ 1 _ 1 _ 


2468 10 2468 10 

l i 

FIG. 7. Information measures as a function of the present’s 
length l. Since the examples are stationary, Unitary processes, 
both and qi converge to zero with increasing £ and 
converges to a constant value of E. And so, the growth of 
H(f) is entirely captured in r£, and it grows linearly with £. 

process is state mixing or not. If it is, then: 

= qfj, + 2 b^ + 

= 0 + 2E + . 

That is, = —E + Ih^. Though, in the case of no state 
mixing, we have: 

m = ti+K+ r i 

= e + 0 +ri. 

That is, = Ih So, in either case, grows linearly 
asymptotically with a rate of h . If there is state mixing, 
however, it has a subextensive part equal to — E. 

C. Exponential Convergence of 

One way to classify processes is whether or not an 
observer can determine the causal state a process is in 
from finite or infinite sequence measurements. If so, then 
the process is synchronizable. All of the previous examples 
are synchronizable. References EH HE] proved that for 
any synchronizable process described by a finite-state 
HMM, there exist constants K > 0 and 0 < a < 1 such 


[bits] 
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0.40 
cu 0.35 

+3 

o 

ft 030 
^ 0.25 

CQ 

13 0.20 
cc 

-1-3 

z; o.i5 
§ 0.10 
0.05 
0.00 

123456789 10 123456789 10 

Length of the present 

FIG. 8. of (solid) and its asymptote (dashed) for (a) Nemo 
Process with K = 0.6 and a = 0.64 and (b) Even Process 
with K = 1 and a = 0.71. 

yields: 




er 


* < 
M — 


Ka e+1 
1 — a 


that: 


h^t) — h^< Ka e , for all t £ N , (20) 

where: 

M*) = HM-H(f-l) ■ 

Note that h^( 1) = p 1 . One well known identity [TT] is 
that the sum of the (£) terms is the excess entropy: 

OO 

E = £>„(*)( 21 ) 

fc=l 

OO 

= p\ + - K) • ( 22 ) 

k=2 


or simply: 


< K'ot 

U 1 


(25) 


We now drop the prime, simplifying the form. Thus, 
the elusive information vanishes exponentially fast for 
synchronizable processes. 

Figure [5] compares a f with its best-fit exponential 
bound for two different processes: the Nemo Process 
shown in Fig.[9]and the Even Process. For each, the solid 
line is of. and the dashed is the fit. Estimated values for 
the Nemo Process are K = 0.6 and a = 0.64. The fit pa¬ 
rameters for the Even Process are K = 1.0 and a = 0.71. 
They were estimated in accordance with the conditions 


Eq. (25). 


stated for Eq. (201. The fits validate the convergence in 


This provides a new identity m- 


OO 

a l = E(M fc ) - K) > ( 23 ) 

k=2 

which can be generalized to: 

OO 

= E (M fc ) - K) ■ ( 24 ) 

k=e+i 


Applying the bound from Eq. (20) to each term, we 
find: 


E (M fc ) - K) < E Kak ■ 

fc=£+1 k—(.+1 

The right-hand side, being a convergent geometric series, 


V. MEASURES OF EMERGENCE? 

The elusive information of conditions on a present of 
length t. What if we do not condition, simply ignoring the 
present? It becomes the persistent mutual information 
(PMI) jHHS]: 


PMICQ = I[X 0 : Xd ■ (26) 

Notably, PMI(oo) was offered up as a measure of “emer¬ 
gence” in general complex systems. Our preceding analy¬ 
sis, though, gives a more nuanced view of this interpre¬ 
tation, especially when emergence is considered in light 
of structural criteria introduced previously 130 ED- Our 
framework reveals that PMI(t') is not an atomic measure; 
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rather it consists of two familiar components: 

PMI (*) = <£ + a£. (27) 

Which component is most important? Are both? Which 
is associated with emergence? Both? 

Section |IV C| showed that synchronizable processes have 
tr^ —> 0. So, for this broad class at least, PMI(oo) = 
Based on extensive process surveys that we do not report 
on here, we conjecture that er?° = 0 holds even more 
generally. And so, it appears that PMI(oo) generally is 
dominated by the multivariate mutual information g?°. 
Moreover, recalling the analysis of q 1 for the Noisy Period- 
3 Process shown in the upper-right panel of Fig. [7] it 
appears that PMI(oo) is only sensitive to periodicity or 
noisy periodicity, giving log 2 p, where p is the period. 

As a test of our conjecture that the elusive informa¬ 
tion vanishes and that PMI(oo) is dominated by g^°, we 
applied our information-measure estimation methods to 
the symbolic dynamics generated by the Logistic Map of 
the unit interval as a function of its control parameter r. 
Figure [TO] plots the results. Indeed, the elusive informa¬ 
tion does vanish. Thus, we conclude that PMI(oo) is a 
property of . 

In addition, our simulation results reproduced those in 
Ref. [5]’s PMI(oo) analysis of the Logistic Map; though, 
their estimation method for PMI(oo) differs markedly. 
Here, we calculate via the Logistic Map symbolic dy¬ 
namics; there, joint distributions over the continuous 
unit-interval domain were used. Both investigations lead 
to the conclusion that PMI(oo) is equal to (the logarithm 
of) the number of chaotic “bands” cyclically permuted or 
the period of the periodic orbit at a given parameter value. 
In short, PMI(oo) is a measure of non-mixing dynamics. 

Given the restricted form of structure (periodicity) to 
which it is sensitive, PMI(oo) cannot be taken as a gen¬ 
eral measure for detecting the emergence of organization 
in complex systems. No matter, though a quarter of a 
century old, the statistical complexity m — a direct mea¬ 
sure of structural organization and stored information 
continues to fill the role of detecting emergent organi¬ 
zation quite well. Moreover, computational mechanics’ 
e-machines directly show what the emergent organization 
is. 


VI. CONCLUSION 

We first defined the elusive information and developed 
a closed-form analytic expression to calculate it from a 
process’s hidden Markov model. The sequel m shows 
how to use spectral methods m to develop alternative 
closed-form expressions for the elusive information and 



3.2 3.4 3.6 3.8 4 

r 


FIG. 10. Persistent mutual information PMI(oo), elusivity ojj°, 
and multivariate mutual information of the Logistic Map 
symbolic dynamics as a function of map control parameter 
r. Recall that the symbolic dynamics does not see period¬ 
doubling until r is above the appearance of the associated 
superstable periodic orbit. This discrepancy in the appearance 
of periodicity as a function of r does not occur when the map 
is chaotic. (Cf. Fig. 1 of Ref. [9].) 


its companions. 

Investigating how the present shields the past and fu¬ 
ture is essentially a study of what Markov order means 
for structured processes. It gives some insight into the 
process of modeling building and even general concerns 
about emergence in complex systems. First, it is a com¬ 
mon ground on which to contrast structural inference and 
emergence, showing that we should not conflate these two 
distinct questions. Perhaps most constructively, though, 
it sheds light on the challenges of inference for complex 
systems. In particular, when er^ > 0 sequence statistics 
are inadequate for modeling and so we must use state- 
based models to properly, finitely represent a process’s 
internal organization. 

We showed how present observables typically do not 
contain all of the information that correlates the past 
and the future. One consequence is that instantaneous 
measurements are not enough. This means, exactly, that 
Markov chain models of complex physical systems are 
fundamentally inadequate, though eminently helpful and 
simplifying when they are appropriate representations. 
The larger consequence is that we must build state-based 
models and not use mere look-up tables or sequence his¬ 
tograms. And this means, in turn, that monitoring only 
prediction performance is inadequate. We must also mon¬ 
itor model complexity, not as an antidote to over-fitting, 
but as a fundamental goal for both prediction and under¬ 
standing hidden mechanisms. 
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